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Abstract 

Spanner is Google’s scalable, multi-version, globally- 
distributed, and synchronously-replicated database. It is 
the first system to distribute data at global scale and sup¬ 
port externally-consistent distributed transactions. This 
paper describes how Spanner is structured, its feature set, 
the rationale underlying various design decisions, and a 
novel time API that exposes clock uncertainty. This API 
and its implementation are critical to supporting exter¬ 
nal consistency and a variety of powerful features: non- 
blocking reads in the past, lock-free read-only transac¬ 
tions, and atomic schema changes, across all of Spanner. 


1 Introduction 


Spanner is a scalable, globally-distributed database de¬ 
signed, built, and deployed at Google. At the high¬ 
est level of abstraction, it is a database that shards data 
across many sets of Paxos \21 | state machines in data¬ 
centers spread all over the world. Replication is used for 
global availability and geographic locality; clients auto¬ 
matically failover between replicas. Spanner automati¬ 
cally reshards data across machines as the amount of data 
or the number of servers changes, and it automatically 
migrates data across machines (even across datacenters) 
to balance load and in response to failures. Spanner is 
designed to scale up to millions of machines across hun¬ 
dreds of datacenters and trillions of database rows. 


Applications can use Spanner for high availability, 
even in the face of wide-area natural disasters, by repli¬ 
cating their data within or even across continents. Our 
initial customer was FI (35) , a rewrite of Google’s ad¬ 
vertising backend. FI uses five replicas spread across 
the United States. Most other applications will probably 
replicate their data across 3 to 5 datacenters in one ge¬ 
ographic region, but with relatively independent failure 
modes. That is, most applications will choose lower la¬ 


tency over higher availability, as long as they can survive 
1 or 2 datacenter failures. 

Spanner’s main focus is managing cross-datacenter 
replicated data, but we have also spent a great deal of 
time in designing and implementing important database 
features on top of our distributed-systems infrastructure. 
Even though many projects happily use Bigtable we 
have also consistently received complaints from users 
that Bigtable can be difficult to use for some kinds of ap¬ 
plications: those that have complex, evolving schemas, 
or those that want strong consistency in the presence of 
wide-area replication. (Similar claims have been made 
by other authors (37) .) Many applications at Google 
have chosen to use Megastore (5j| because of its semi- 
relational data model and support for synchronous repli¬ 
cation, despite its relatively poor write throughput. As a 
consequence. Spanner has evolved from a Bigtable-like 
versioned key-value store into a temporal multi-version 
database. Data is stored in schematized semi-relational 
tables; data is versioned, and each version is automati¬ 
cally timestamped with its commit time; old versions of 
data are subject to configurable garbage-collection poli¬ 
cies; and applications can read data at old timestamps. 
Spanner supports general-purpose transactions, and pro¬ 
vides a SQL-based query language. 

As a globally-distributed database. Spanner provides 
several interesting features. First, the replication con¬ 
figurations for data can be dynamically controlled at a 
fine grain by applications. Applications can specify con¬ 
straints to control which datacenters contain which data, 
how far data is from its users (to control read latency), 
how far replicas are from each other (to control write la¬ 
tency), and how many replicas are maintained (to con¬ 
trol durability, availability, and read performance). Data 
can also be dynamically and transparently moved be¬ 
tween datacenters by the system to balance resource us¬ 
age across datacenters. Second, Spanner has two features 
that are difficult to implement in a distributed database: it 


To appear in Proceedings of OSDI2012 


1 



provides externally consistent reads and writes, and 
globally-consistent reads across the database at a time- 
stamp. These features enable Spanner to support con¬ 
sistent backups, consistent MapReduce executions [12], 
and atomic schema updates, all at global scale, and even 
in the presence of ongoing transactions. 

These features are enabled by the fact that Spanner as¬ 
signs globally-meaningful commit timestamps to trans¬ 
actions, even though transactions may be distributed. 
The timestamps reflect serialization order. In addition, 
the serialization order satisfies external consistency (or 
equivalently, linearizability |2()|): if a transaction T) 
commits before another transaction T 2 starts, then Ti’s 
commit timestamp is smaller than TVs. Spanner is the 
first system to provide such guarantees at global scale. 

The key enabler of these properties is a new TrueTime 
API and its implementation. The API directly exposes 
clock uncertainty, and the guarantees on Spanner’s times¬ 
tamps depend on the bounds that the implementation pro¬ 
vides. If the uncertainty is large. Spanner slows down to 
wait out that uncertainty. Google’s cluster-management 
software provides an implementation of the TrueTime 
API. This implementation keeps uncertainty small (gen¬ 
erally less than 10ms) by using multiple modern clock 
references (GPS and atomic clocks). 

Section [2] describes the structure of Spanner’s imple¬ 
mentation, its feature set, and the engineering decisions 
that went into their design. Section[3]describes our new 
TrueTime API and sketches its implementation. Sec¬ 
tion [4] describes how Spanner uses TrueTime to imple¬ 
ment externally-consistent distributed transactions, lock- 
free read-only transactions, and atomic schema updates. 
Section [5] provides some benchmarks on Spanner’s per¬ 
formance and TrueTime behavior, and discusses the ex¬ 
periences of FI. Sections [6][7] and[8]describe related and 
future work, and summarize our conclusions. 
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Figure 1: Spanner server organization. 


servers Q. Zones are the unit of administrative deploy¬ 
ment. The set of zones is also the set of locations across 
which data can be replicated. Zones can be added to or 
removed from a running system as new datacenters are 
brought into service and old ones are turned off, respec¬ 
tively. Zones are also the unit of physical isolation: there 
may be one or more zones in a datacenter, for example, 
if different applications’ data must be partitioned across 
different sets of servers in the same datacenter. 

Figure [I] illustrates the servers in a Spanner universe. 
A zone has one zonemaster and between one hundred 
and several thousand spanservers. The former assigns 
data to spanservers; the latter serve data to clients. The 
per-zone location proxies are used by clients to locate 
the spanservers assigned to serve their data. The uni¬ 
verse master and the placement driver are currently sin¬ 
gletons. The universe master is primarily a console that 
displays status information about all the zones for inter¬ 
active debugging. The placement driver handles auto¬ 
mated movement of data across zones on the timescale 
of minutes. The placement driver periodically commu¬ 
nicates with the spanservers to find data that needs to be 
moved, either to meet updated replication constraints or 
to balance load. For space reasons, we will only describe 
the spanserver in any detail. 


2 Implementation 

This section describes the structure of and rationale un¬ 
derlying Spanner’s implementation. It then describes the 
directory abstraction, which is used to manage replica¬ 
tion and locality, and is the unit of data movement. Fi¬ 
nally, it describes our data model, why Spanner looks 
like a relational database instead of a key-value store, and 
how applications can control data locality. 

A Spanner deployment is called a universe. Given 
that Spanner manages data globally, there will be only 
a handful of running universes. We currently run a 
test/playground universe, a development/production uni¬ 
verse, and a production-only universe. 

Spanner is organized as a set of zones, where each 
zone is the rough analog of a deployment of Bigtable 


2.1 Spanserver Software Stack 

This section focuses on the spanserver implementation 
to illustrate how replication and distributed transactions 
have been layered onto our Bigtable-based implementa¬ 
tion. The software stack is shown in Figure [2] At the 
bottom, each spanserver is responsible for between 100 
and 1000 instances of a data structure called a tablet. A 
tablet is similar to Bigtable’s tablet abstraction, in that it 
implements a bag of the following mappings: 

(key:string, timestamp:int64) string 

Unlike Bigtable, Spanner assigns timestamps to data, 
which is an important way in which Spanner is more 
like a multi-version database than a key-value store. A 
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Figure 2: Spanserver software stack. 

tablet’s state is stored in set of B-tree-like files and a 
write-ahead log, all on a distributed file system called 
Colossus (the successor to the Google File System ED)- 

To support replication, each spanserver implements a 
single Paxos state machine on top of each tablet. (An 
early Spanner incarnation supported multiple Paxos state 
machines per tablet, which allowed for more flexible 
replication configurations. The complexity of that de¬ 
sign led us to abandon it.) Each state machine stores 
its metadata and log in its corresponding tablet. Our 
Paxos implementation supports long-lived leaders with 
time-based leader leases, whose length defaults to 10 
seconds. The current Spanner implementation logs ev¬ 
ery Paxos write twice: once in the tablet’s log, and once 
in the Paxos log. This choice was made out of expedi¬ 
ency, and we are likely to remedy this eventually. Our 
implementation of Paxos is pipelined, so as to improve 
Spanner’s throughput in the presence of WAN latencies; 
but writes are applied by Paxos in order (a fact on which 
we will depend in Section[4|. 

The Paxos state machines are used to implement a 
consistently replicated bag of mappings. The key-value 
mapping state of each replica is stored in its correspond¬ 
ing tablet. Writes must initiate the Paxos protocol at the 
leader; reads access state directly from the underlying 
tablet at any replica that is sufficiently up-to-date. The 
set of replicas is collectively a Paxos group. 

At every replica that is a leader, each spanserver im¬ 
plements a lock table to implement concurrency control. 
The lock table contains the state for two-phase lock¬ 
ing: it maps ranges of keys to lock states. (Note that 
having a long-lived Paxos leader is critical to efficiently 
managing the lock table.) In both Bigtable and Span¬ 
ner, we designed for long-lived transactions (for exam¬ 
ple, for report generation, which might take on the order 
of minutes), which perform poorly under optimistic con¬ 
currency control in the presence of conflicts. Operations 



Figure 3: Directories are the unit of data movement between 
Paxos groups. 

that require synchronization, such as transactional reads, 
acquire locks in the lock table; other operations bypass 
the lock table. 

At every replica that is a leader, each spanserver also 
implements a transaction manager to support distributed 
transactions. The transaction manager is used to imple¬ 
ment a participant leader, the other replicas in the group 
will be referred to as participant slaves. If a transac¬ 
tion involves only one Paxos group (as is the case for 
most transactions), it can bypass the transaction manager, 
since the lock table and Paxos together provide transac- 
tionality. If a transaction involves more than one Paxos 
group, those groups’ leaders coordinate to perform two- 
phase commit. One of the participant groups is chosen as 
the coordinator: the participant leader of that group will 
be referred to as the coordinator leader, and the slaves of 
that group as coordinator slaves. The state of each trans¬ 
action manager is stored in the underlying Paxos group 
(and therefore is replicated). 

2.2 Directories and Placement 

On top of the bag of key-value mappings, the Spanner 
implementation supports a bucketing abstraction called a 
directory, which is a set of contiguous keys that share a 
common prefix. (The choice of the term directory is a 
historical accident; a better term might be bucket.) We 
will explain the source of that prefix in Section [23| Sup¬ 
porting directories allows applications to control the lo¬ 
cality of their data by choosing keys carefully. 

A directory is the unit of data placement. All data in 
a directory has the same replication configuration. When 
data is moved between Paxos groups, it is moved direc¬ 
tory by directory, as shown in Figure [3] Spanner might 
move a directory to shed load from a Paxos group; to put 
directories that are frequently accessed together into the 
same group; or to move a directory into a group that is 
closer to its accessors. Directories can be moved while 
client operations are ongoing. One could expect that a 
50MB directory can be moved in a few seconds. 

The fact that a Paxos group may contain multiple di¬ 
rectories implies that a Spanner tablet is different from 
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a Bigtable tablet: the former is not necessarily a single 
lexicographically contiguous partition of the row space. 
Instead, a Spanner tablet is a container that may encap¬ 
sulate multiple partitions of the row space. We made this 
decision so that it would be possible to colocate multiple 
directories that are frequently accessed together. 

Movedir is the background task used to move direc¬ 
tories between Paxos groups G3- Movedir is also used 
to add or remove replicas to Paxos groups 1251, be¬ 
cause Spanner does not yet support in-Paxos configura¬ 
tion changes. Movedir is not implemented as a single 
transaction, so as to avoid blocking ongoing reads and 
writes on a bulky data move. Instead, movedir registers 
the fact that it is starting to move data and moves the data 
in the background. When it has moved all but a nominal 
amount of the data, it uses a transaction to atomically 
move that nominal amount and update the metadata for 
the two Paxos groups. 

A directory is also the smallest unit whose geographic- 
replication properties (or placement , for short) can 
be specified by an application. The design of our 
placement-specification language separates responsibil¬ 
ities for managing replication configurations. Adminis¬ 
trators control two dimensions: the number and types of 
replicas, and the geographic placement of those replicas. 
They create a menu of named options in these two di¬ 
mensions (e.g.. North America, replicated 5 ways with 
1 witness). An application controls how data is repli¬ 
cated, by tagging each database and/or individual direc¬ 
tories with a combination of those options. For example, 
an application might store each end-user’s data in its own 
directory, which would enable user A’s data to have three 
replicas in Europe, and user fi’s data to have five replicas 
in North America. 

For expository clarity we have over-simplified. In fact, 
Spanner will shard a directory into multiple fragments 
if it grows too large. Fragments may be served from 
different Paxos groups (and therefore different servers). 
Movedir actually moves fragments, and not whole direc¬ 
tories, between groups. 


2.3 Data Model 

Spanner exposes the following set of data features 
to applications: a data model based on schematized 
semi-relational tables, a query language, and general- 
purpose transactions. The move towards support¬ 
ing these features was driven by many factors. The 
need to support schematized semi-relational tables and 
synchronous replication is supported by the popular¬ 
ity of Megastore ||5j. At least 300 applications within 
Google use Megastore (despite its relatively low per¬ 
formance) because its data model is simpler to man¬ 


age than Bigtable’s, and because of its support for syn¬ 
chronous replication across datacenters. (Bigtable only 
supports eventually-consistent replication across data¬ 
centers.) Examples of well-known Google applications 
that use Megastore are Gmail, Picasa, Calendar, Android 
Market, and AppEngine. The need to support a SQL- 
like query language in Spanner was also clear, given 
the popularity of Dremel [28j as an interactive data- 
analysis tool. Finally, the lack of cross-row transactions 
in Bigtable led to frequent complaints; Percolator 1321 
was in part built to address this failing. Some authors 
have claimed that general two-phase commit is too ex¬ 
pensive to support, because of the performance or avail¬ 
ability problems that it brings SUM- We believe it 
is better to have application programmers deal with per¬ 
formance problems due to overuse of transactions as bot¬ 
tlenecks arise, rather than always coding around the lack 
of transactions. Running two-phase commit over Paxos 
mitigates the availability problems. 


The application data model is layered on top of the 
directory-bucketed key-value mappings supported by the 
implementation. An application creates one or more 
databases in a universe. Each database can contain an 
unlimited number of schematized tables. Tables look 
like relational-database tables, with rows, columns, and 
versioned values. We will not go into detail about the 
query language for Spanner. It looks like SQL with some 
extensions to support protocol-buffer-valued fields. 


Spanner’s data model is not purely relational, in that 
rows must have names. More precisely, every table is re¬ 
quired to have an ordered set of one or more primary-key 
columns. This requirement is where Spanner still looks 
like a key-value store: the primary keys form the name 
for a row, and each table defines a mapping from the 
primary-key columns to the non-primary-key columns. 
A row has existence only if some value (even if it is 
NULL) is defined for the row’s keys. Imposing this struc¬ 
ture is useful because it lets applications control data lo¬ 
cality through their choices of keys. 

Figure[4]contains an example Spanner schema for stor¬ 
ing photo metadata on a per-user, per-album basis. The 
schema language is similar to Megastore’s, with the ad¬ 
ditional requirement that every Spanner database must 
be partitioned by clients into one or more hierarchies 
of tables. Client applications declare the hierarchies in 
database schemas via the INTERLEAVE IN declara¬ 
tions. The table at the top of a hierarchy is a directory 
table. Each row in a directory table with key K , together 
with all of the rows in descendant tables that start with K 
in lexicographic order, forms a directory. ON DELETE 
CASCADE says that deleting a row in the directory table 
deletes any associated child rows. The figure also illus¬ 
trates the interleaved layout for the example database: for 
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CREATE TABLE Users { 

uid INT64 NOT NULL, email STRING 

} PRIMARY KEY (uid), DIRECTORY; 

CREATE TABLE Albums { 

uid INT64 NOT NULL, aid INT64 NOT NULL, 
name STRING 

) PRIMARY KEY (uid, aid) , 

INTERLEAVE IN PARENT Users ON DELETE CASCADE; 



Users(l) 

Albums(l.l) 

Albumsfl ,2) 


Users(2) 

Albums(2,1) 

Albums(2.2) 

Albums(2.3) 


Directory 3665 


Directory 453 


Figure 4: Example Spanner schema for photo metadata, and 
the interleaving implied by INTERLEAVE IN. 


example. Albums (2,1) represents the row from the 
Albums table for user_id 2, album_id 1. This 
interleaving of tables to form directories is significant 
because it allows clients to describe the locality relation¬ 
ships that exist between multiple tables, which is nec¬ 
essary for good performance in a sharded, distributed 
database. Without it. Spanner would not know the most 
important locality relationships. 

3 TrueTime 


Method 

Returns 

TT.nowQ 

TTinterval : [ earliest , latest ] 

TT.afterQ ) 

true if t has definitely passed 

TT.beforeQ) 

true if t has definitely not arrived 


Table 1: TrueTime API. The argument t is of type TTstamp. 

This section describes the TrueTime API and sketches 
its implementation. We leave most of the details for an¬ 
other paper: our goal is to demonstrate the power of 
having such an API. Table [T] lists the methods of the 
API. TrueTime explicitly represents time as a TTinterval, 
which is an interval with bounded time uncertainty (un¬ 
like standard time interfaces that give clients no notion 
of uncertainty). The endpoints of a TTinterval are of 
type TTstamp. The TT.nowQ method returns a TTinterval 
that is guaranteed to contain the absolute time during 
which TT.nowQ was invoked. The time epoch is anal¬ 
ogous to UNIX time with leap-second smearing. De¬ 
fine the instantaneous error bound as e, which is half of 
the interval’s width, and the average error bound as e. 
The TT.afterQ and TT.beforeQ methods are convenience 
wrappers around TT.nowQ. 


Denote the absolute time of an event e by the func¬ 
tion fflfo s (e). In more formal terms, TrueTime guaran¬ 
tees that for an invocation tt = TT.nowQ, tt.earliest < 
t a bs(e n ow) < tt.latest, where e now is the invocation event. 

The underlying time references used by TrueTime 
are GPS and atomic clocks. TrueTime uses two forms 
of time reference because they have different failure 
modes. GPS reference-source vulnerabilities include an¬ 
tenna and receiver failures, local radio interference, cor¬ 
related failures (e.g., design faults such as incorrect leap- 
second handling and spoofing), and GPS system outages. 
Atomic clocks can fail in ways uncorrelated to GPS and 
each other, and over long periods of time can drift signif¬ 
icantly due to frequency error. 

TrueTime is implemented by a set of time master ma¬ 
chines per datacenter and a timeslave daemon per ma¬ 
chine. The majority of masters have GPS receivers with 
dedicated antennas; these masters are separated physi¬ 
cally to reduce the effects of antenna failures, radio in¬ 
terference, and spoofing. The remaining masters (which 
we refer to as Armageddon masters ) are equipped with 
atomic clocks. An atomic clock is not that expensive: 
the cost of an Armageddon master is of the same order 
as that of a GPS master. All masters’ time references 
are regularly compared against each other. Each mas¬ 
ter also cross-checks the rate at which its reference ad¬ 
vances time against its own local clock, and evicts itself 
if there is substantial divergence. Between synchroniza¬ 
tions, Armageddon masters advertise a slowly increasing 
time uncertainty that is derived from conservatively ap¬ 
plied worst-case clock drift. GPS masters advertise un¬ 
certainty that is typically close to zero. 

Every daemon polls a variety of masters (29) to re¬ 
duce vulnerability to errors from any one master. Some 
are GPS masters chosen from nearby datacenters; the 
rest are GPS masters from farther datacenters, as well 
as some Armageddon masters. Daemons apply a variant 
of Marzullo’s algorithm (27) to detect and reject liars, 
and synchronize the local machine clocks to the non¬ 
liars. To protect against broken local clocks, machines 
that exhibit frequency excursions larger than the worst- 
case bound derived from component specifications and 
operating environment are evicted. 

Between synchronizations, a daemon advertises a 
slowly increasing time uncertainty, e is derived from 
conservatively applied worst-case local clock drift, e also 
depends on time-master uncertainty and communication 
delay to the time masters. In our production environ¬ 
ment, e is typically a sawtooth function of time, varying 
from about 1 to 7 ms over each poll interval, e is there¬ 
fore 4 ms most of the time. The daemon’s poll interval is 
currently 30 seconds, and the current applied drift rate is 
set at 200 microseconds/second, which together account 
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Snapshot Read, client-provided timestamp 

— 

lock-free 

any, subject to § 

4.1.3 


Snapshot Read, client-provided bound 

§ 

4.1.3 


lock-free 

any, subject to § 

4.1.3 



Table 2: Types of reads and writes in Spanner, and how they compare. 


for the sawtooth bounds from 0 to 6 ms. The remain¬ 
ing 1 ms comes from the communication delay to the 
time masters. Excursions from this sawtooth are possi¬ 
ble in the presence of failures. For example, occasional 
time-master unavailability can cause datacenter-wide in¬ 
creases in e. Similarly, overloaded machines and network 
links can result in occasional localized e spikes. 

4 Concurrency Control 

This section describes how TrueTime is used to guaran¬ 
tee the correctness properties around concurrency con¬ 
trol, and how those properties are used to implement 
features such as externally consistent transactions, lock- 
free read-only transactions, and non-blocking reads in 
the past. These features enable, for example, the guar¬ 
antee that a whole-database audit read at a timestamp t 
will see exactly the effects of every transaction that has 
committed as of t. 

Going forward, it will be important to distinguish 
writes as seen by Paxos (which we will refer to as Paxos 
writes unless the context is clear) from Spanner client 
writes. For example, two-phase commit generates a 
Paxos write for the prepare phase that has no correspond¬ 
ing Spanner client write. 

4.1 Timestamp Management 

Table [2] lists the types of operations that Spanner sup¬ 
ports. The Spanner implementation supports read- 
write transactions, read-only transactions (predeclared 
snapshot-isolation transactions), and snapshot reads. 
Standalone writes are implemented as read-write trans¬ 
actions; non-snapshot standalone reads are implemented 
as read-only transactions. Both are internally retried 
(clients need not write their own retry loops). 

A read-only transaction is a kind of transaction that 
has the performance benefits of snapshot isolation ©• 
A read-only transaction must be predeclared as not hav¬ 
ing any writes; it is not simply a read-write transaction 
without any writes. Reads in a read-only transaction ex¬ 
ecute at a system-chosen timestamp without locking, so 
that incoming writes are not blocked. The execution of 


the reads in a read-only transaction can proceed on any 
replica that is sufficiently up-to-date (Section 4.1.3| >. 

A snapshot read is a read in the past that executes with¬ 
out locking. A client can either specify a timestamp for a 
snapshot read, or provide an upper bound on the desired 
timestamp’s staleness and let Spanner choose a time- 
stamp. In either case, the execution of a snapshot read 
proceeds at any replica that is sufficiently up-to-date. 

For both read-only transactions and snapshot reads, 
commit is inevitable once a timestamp has been cho¬ 
sen, unless the data at that timestamp has been garbage- 
collected. As a result, clients can avoid buffering results 
inside a retry loop. When a server fails, clients can inter¬ 
nally continue the query on a different server by repeat¬ 
ing the timestamp and the current read position. 


4.1.1 Paxos Leader Leases 

Spanner’s Paxos implementation uses timed leases to 
make leadership long-lived (10 seconds by default). A 
potential leader sends requests for timed lease votes', 
upon receiving a quorum of lease votes the leader knows 
it has a lease. A replica extends its lease vote implicitly 
on a successful write, and the leader requests lease-vote 
extensions if they are near expiration. Define a leader’s 
lease interval as starting when it discovers it has a quo¬ 
rum of lease votes, and as ending when it no longer has 
a quorum of lease votes (because some have expired). 
Spanner depends on the following disjointness invariant: 
for each Paxos group, each Paxos leader’s lease interval 
is disjoint from every other leader’s. Appendix [A] de¬ 
scribes how this invariant is enforced. 

The Spanner implementation permits a Paxos leader 
to abdicate by releasing its slaves from their lease votes. 
To preserve the disjointness invariant. Spanner constrains 
when abdication is permissible. Define s max to be the 
maximum timestamp used by a leader. Subsequent sec¬ 
tions will describe when s max is advanced. Before abdi¬ 
cating, a leader must wait until TT.after(s max ) is true. 

4.1.2 Assigning Timestamps to RW Transactions 

Transactional reads and writes use two-phase locking. 
As a result, they can be assigned timestamps at any time 
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when all locks have been acquired, but before any locks 
have been released. For a given transaction. Spanner as¬ 
signs it the timestamp that Paxos assigns to the Paxos 
write that represents the transaction commit. 

Spanner depends on the following monotonicity in¬ 
variant: within each Paxos group, Spanner assigns times¬ 
tamps to Paxos writes in monotonically increasing or¬ 
der, even across leaders. A single leader replica can triv¬ 
ially assign timestamps in monotonically increasing or¬ 
der. This invariant is enforced across leaders by making 
use of the disjointness invariant: a leader must only as¬ 
sign timestamps within the interval of its leader lease. 
Note that whenever a timestamp s is assigned, s max is 
advanced to s to preserve disjointness. 

Spanner also enforces the following external- 
consistency invariant: if the start of a transaction T 2 
occurs after the commit of a transaction Tj, then the 
commit timestamp of T 2 must be greater than the 
commit timestamp of Xj. Define the start and commit 
events for a trans action T, by ef art and e c ° mn, "\ and 
the commit timestamp of a transaction 7) by s,. The 
invariant becomes t a b s (e c ° mm,t ) < t a bs{e2° rt ) => si < S2. 
The protocol for executing transactions and assigning 
timestamps obeys two rules, which together guarantee 
this invariant, as shown below. Define the arrival event 
of the commit request at the coordinator leader for a 
write Ti to be ef ' n ' er . 

Start The coordinator leader for a write T) assigns 
a commit timestamp Sj no less than the value of 
TT.now().latest, computed after ef n ’ er . Note that the 
participant leaders do not matter here; Section 4.2.1| de- 
scribes how they are involved in the implementation of 
the next rule. 

Commit Wait The coordinator leader ensures that 
clients cannot see any data committed by Ti until 
TT.after(si) is true. Commit wait ensures that s, is 
less than the absolute commit time of 7), or s t < 
The implementation of commit wait is de¬ 
scribed in Section !?. 2. II Proof: 


si < t abs {eT mmi ’) 

tabsyex ) < t a b S [e 2 ) 

tabs{eT) < t abs (er er ) 
tabs(er er ) < S2 
Si < s 2 


(commit wait) 
(assumption) 
(causality) 
(start) 
(transitivity) 


maximum timestamp at which a replica is up-to-date. A 
replica can satisfy a read at a timestamp / if t <= t sa f e . 

safe = rnin(t'’“f° s , t™ e ), where each Paxos 


Define t 


state machine has a safe time t p s “f° s and each transac¬ 
tion manager has a safe time t™ e . t p s ^° s is simpler: it 
is the timestamp of the highest-applied Paxos write. Be¬ 
cause timestamps increase monotonically and writes are 
applied in order, writes will no longer occur at or below 
tmf° s W i’ ; h res P e ct to Paxos. 

t™ e is 00 at a replica if there are zero prepared (but 
not committed) transactions—that is, transactions in be¬ 
tween the two phases of two-phase commit. (For a par¬ 
ticipant slave, t™ e actually refers to the replica’s leader’s 
transaction manager, whose state the slave can infer 
through metadata passed on Paxos writes.) If there are 
any such transactions, then the state affected by those 
transactions is indeterminate: a participant replica does 
not know yet whether such transactions will commit. As 


we discuss in Section 4.2.1 the commit protocol ensures 
that every participant knows a lower bound on a pre¬ 
pared transaction’s timestamp. Every participant leader 
(for a group g) for a transaction 7) assigns a prepare 
timestamp s prepare to its prepare record. The coordinator 
leader ensures that the transaction’s commit timestamp 
Si >= s prepare over all participant groups g. Therefore, 
for every replica in a group g, over all transactions Ti pre¬ 
pared at g, t™ e = mirii(s p i re g pare ) — 1 over all transactions 
prepared at g. 


4.1.4 Assigning Timestamps to RO Transactions 

A read-only transaction executes in two phases: assign 
a timestamp s rea d |8|, and then execute the transaction’s 
reads as snapshot reads at s reac i■ The snapshot reads can 
execute at any replicas that are sufficiently up-to-date. 

The simple assignment of s rea d = TT.nowQ.latest, at 
any time after a transaction starts, preserves external con¬ 
sistency by an argument analogous to that presented for 


writes in Section 4.1.2 However, such a timestamp may 


require the execution of the data reads at s rea d to block 
if t sa f e has not advanced sufficiently. (In addition, note 
that choosing a value of s rea d may also advance s max to 
preserve disjointness.) To reduce the chances of block¬ 
ing, Spanner should assign the oldest timestamp that pre¬ 


serves external consistency. Section 4.2.2 explains how 
such a timestamp can be chosen. 


4.1.3 Serving Reads at a Timestamp 


The monotonicity invariant described in Section 4.1.2 al¬ 
lows Spanner to correctly determine whether a replica’s 
state is sufficiently up-to-date to satisfy a read. Every 
replica tracks a value called safe time t sa f e which is the 


4.2 Details 

This section explains some of the practical details of 
read-write transactions and read-only transactions elided 
earlier, as well as the implementation of a special trans¬ 
action type used to implement atomic schema changes. 
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It then describes some refinements of the basic schemes 
as described. 


4.2.1 Read-Write Transactions 


Like Bigtable, writes that occur in a transaction are 
buffered at the client until commit. As a result, reads 
in a transaction do not see the effects of the transaction’s 
writes. This design works well in Spanner because a read 
returns the timestamps of any data read, and uncommit¬ 
ted writes have not yet been assigned timestamps. 

Reads within read-write transactions use wound- 
wait |33 ] to avoid deadlocks. The client issues reads 
to the leader replica of the appropriate group, which 
acquires read locks and then reads the most recent 
data. While a client transaction remains open, it sends 
keepalive messages to prevent participant leaders from 
timing out its transaction. When a client has completed 
all reads and buffered all writes, it begins two-phase 
commit. The client chooses a coordinator group and 
sends a commit message to each participant’s leader with 
the identity of the coordinator and any buffered writes. 
Having the client drive two-phase commit avoids send¬ 
ing data twice across wide-area links. 

A non-coordinator-participant leader first acquires 
write locks. It then chooses a prepare timestamp that 
must be larger than any timestamps it has assigned to pre¬ 
vious transactions (to preserve monotonicity), and logs a 
prepare record through Paxos. Each participant then no¬ 
tifies the coordinator of its prepare timestamp. 

The coordinator leader also first acquires write locks, 
but skips the prepare phase. It chooses a timestamp for 
the entire transaction after hearing from all other partici¬ 
pant leaders. The commit timestamp s must be greater or 
equal to all prepare timestamps (to satisfy the constraints 


discussed in Section 4.1.3 1 , greater than TT.now{).latest 
at the time the coordinator received its commit message, 
and greater than any timestamps the leader has assigned 
to previous transactions (again, to preserve monotonic¬ 
ity). The coordinator leader then logs a commit record 
through Paxos (or an abort if it timed out while waiting 
on the other participants). 

Before allowing any coordinator replica to apply 
the commit record, the coordinator leader waits until 
TT.after(s), so as to obey the commit-wait rule described 
in Section l4.1.2l Because the coordinator leader chose s 
based on TT.nowQ.latest, and now waits until that time- 
stamp is guaranteed to be in the past, the expected wait 
is at least 2 * I. This wait is typically overlapped with 
Paxos communication. After commit wait, the coordi¬ 
nator sends the commit timestamp to the client and all 
other participant leaders. Each participant leader logs the 
transaction’s outcome through Paxos. All participants 
apply at the same timestamp and then release locks. 


4.2.2 Read-Only Transactions 

Assigning a timestamp requires a negotiation phase be¬ 
tween all of the Paxos groups that are involved in the 
reads. As a result. Spanner requires a scope expression 
for every read-only transaction, which is an expression 
that summarizes the keys that will be read by the entire 
transaction. Spanner automatically infers the scope for 
standalone queries. 

If the scope’s values are served by a single Paxos 
group, then the client issues the read-only transaction to 
that group’s leader. (The current Spanner implementa¬ 
tion only chooses a timestamp for a read-only transac¬ 
tion at a Paxos leader.) That leader assigns s rea d and ex¬ 
ecutes the read. For a single-site read. Spanner gener¬ 
ally does better than TT.nowf).latest. Define LastTS () to 
be the timestamp of the last committed write at a Paxos 
group. If there are no prepared transactions, the assign¬ 
ment s rea ,i = LastTS () trivially satisfies external consis¬ 
tency: the transaction will see the result of the last write, 
and therefore be ordered after it. 

If the scope’s values are served by multiple Paxos 
groups, there are several options. The most complicated 
option is to do a round of communication with all of 
the groups’s leaders to negotiate s rea d based on LastTSQ. 
Spanner currently implements a simpler choice. The 
client avoids a negotiation round, and just has its reads 
execute at s rea d = TT.nowf).latest (which may wait for 
safe time to advance). All reads in the transaction can be 
sent to replicas that are sufficiently up-to-date. 


4.2.3 Schema-Change Transactions 

TrueTime enables Spanner to support atomic schema 
changes. It would be infeasible to use a standard transac¬ 
tion, because the number of participants (the number of 
groups in a database) could be in the millions. Bigtable 
supports atomic schema changes in one datacenter, but 
its schema changes block all operations. 

A Spanner schema-change transaction is a generally 
non-blocking variant of a standard transaction. First, it 
is explicitly assigned a timestamp in the future, which 
is registered in the prepare phase. As a result, schema 
changes across thousands of servers can complete with 
minimal disruption to other concurrent activity. Sec¬ 
ond, reads and writes, which implicitly depend on the 
schema, synchronize with any registered schema-change 
timestamp at time t: they may proceed if their times¬ 
tamps precede t, but they must block behind the schema- 
change transaction if their timestamps are after t. With¬ 
out TrueTime, defining the schema change to happen at t 
would be meaningless. 
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replicas 

latency (ms) 

throughput (Kops/sec) 

write 

read-only transaction 

snapshot read 

write 

read-only transaction 

snapshot read 

ID 

9.4±.6 

— 

— 

4.0±.3 

— 

— 

1 

14.4±1.0 

1.4±.l 

1.3±.l 

4.1±.05 

10.9±.4 

13.5±.l 

3 

13.9T.6 

1.3±.l 

1.2±.l 

2.2±.5 

13.8±3.2 

38.5T.3 

5 

14.4T.4 

1.4T.05 

1.3T.04 

2.8±.3 

25.3±5.2 

50.0±1.1 


Table 3: Operation microbenchmarks. Mean and standard deviation over 10 runs. ID means one replica with commit wait disabled. 


4.2.4 Refinements 

t™ e as defined above has a weakness, in that a single 
prepared transaction prevents t sa f e from advancing. As 
a result, no reads can occur at later timestamps, even 
if the reads do not conflict with the transaction. Such 
false conflicts can be removed by augmenting t™ e with 
a fine-grained mapping from key ranges to prepared- 
transaction timestamps. This information can be stored 
in the lock table, which already maps key ranges to 
lock metadata. When a read arrives, it only needs to be 
checked against the fine-grained safe time for key ranges 
with which the read conflicts. 

LastTS() as defined above has a similar weakness: if 
a transaction has just committed, a non-conflicting read¬ 
only transaction must still be assigned s rea j so as to fol¬ 
low that transaction. As a result, the execution of the read 
could be delayed. This weakness can be remedied sim¬ 
ilarly by augmenting LastTS( ) with a fine-grained map¬ 
ping from key ranges to commit timestamps in the lock 
table. (We have not yet implemented this optimization.) 
When a read-only transaction arrives, its timestamp can 
be assigned by taking the maximum value of LastTS () 
for the key ranges with which the transaction conflicts, 
unless there is a conflicting prepared transaction (which 
can be determined from fine-grained safe time). 

t P S af° s as defined above has a weakness in that it cannot 
advance in the absence of Paxos writes. That is, a snap¬ 
shot read at t cannot execute at Paxos groups whose last 
write happened before t. Spanner addresses this problem 
by taking advantage of the disjointness of leader-lease 
intervals. Each Paxos leader advances t p s ^° s by keeping 
a threshold above which future writes’ timestamps will 
occur: it maintains a mapping MinNextTS(n) from Paxos 
sequence number n to the minimum timestamp that may 
be assigned to Paxos sequence number n + 1. A replica 
can advance t p “f° s to MinNextTS{n) — 1 when it has ap¬ 
plied through n. 

A single leader can enforce its MinNextTS () 
promises easily. Because the timestamps promised 
by MinNextTS() lie within a leader’s lease, the disjoint¬ 
ness invariant enforces MinNextTSf) promises across 
leaders. If a leader wishes to advance MinNextTS () 
beyond the end of its leader lease, it must first extend its 


lease. Note that s max is always advanced to the highest 
value in MinNextTSf) to preserve disjointness. 

A leader by default advances MinNextTS( ) values ev¬ 
ery 8 seconds. Thus, in the absence of prepared trans¬ 
actions, healthy slaves in an idle Paxos group can serve 
reads at timestamps greater than 8 seconds old in the 
worst case. A leader may also advance MinNextTS() val¬ 
ues on demand from slaves. 

5 Evaluation 

We first measure Spanner’s performance with respect to 
replication, transactions, and availability. We then pro¬ 
vide some data on TrueTime behavior, and a case study 
of our first client, FI. 

5.1 Microbenchmarks 

Table [3] presents some microbenchmarks for Spanner. 
These measurements were taken on timeshared ma¬ 
chines: each spanserver ran on scheduling units of 4GB 
RAM and 4 cores (AMD Barcelona 2200MHz). Clients 
were run on separate machines. Each zone contained one 
spanserver. Clients and zones were placed in a set of dat¬ 
acenters with network distance of less than 1ms. (Such a 
layout should be commonplace: most applications do not 
need to distribute all of their data worldwide.) The test 
database was created with 50 Paxos groups with 2500 di¬ 
rectories. Operations were standalone reads and writes of 
4KB. All reads were served out of memory after a com¬ 
paction, so that we are only measuring the overhead of 
Spanner’s call stack. In addition, one unmeasured round 
of reads was done first to warm any location caches. 

For the latency experiments, clients issued sufficiently 
few operations so as to avoid queuing at the servers. 
From the 1-replica experiments, commit wait is about 
5ms, and Paxos latency is about 9ms. As the number 
of replicas increases, the latency stays roughly constant 
with less standard deviation because Paxos executes in 
parallel at a group’s replicas. As the number of replicas 
increases, the latency to achieve a quorum becomes less 
sensitive to slowness at one slave replica. 

For the throughput experiments, clients issued suffi¬ 
ciently many operations so as to saturate the servers’ 
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participants 

latency (ms) 

mean 

99th percentile 

1 

17.0 ±1.4 

75.0 ±34.9 

2 

24.5 ±2.5 

87.6 ±35.9 

5 

31.5 ±6.2 

104.5 ±52.2 

10 

30.0 ±3.7 

95.6 ±25.4 

25 

35.5 ±5.6 

100.4 ±42.7 

50 

42.7 ±4.1 

93.7 ±22.9 

100 

71.4 ±7.6 

131.2 ±17.6 

200 

150.5 ±11.0 

320.3 ±35.1 


Table 4: Two-phase commit scalability. Mean and standard 
deviations over 10 runs. 
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CPUs. Snapshot reads can execute at any up-to-date 
replicas, so their throughput increases almost linearly 
with the number of replicas. Single-read read-only trans¬ 
actions only execute at leaders because timestamp as¬ 
signment must happen at leaders. Read-only-transaction 
throughput increases with the number of replicas because 
the number of effective spanservers increases: in the 
experimental setup, the number of spanservers equaled 
the number of replicas, and leaders were randomly dis¬ 
tributed among the zones. Write throughput benefits 
from the same experimental artifact (which explains the 
increase in throughput from 3 to 5 replicas), but that ben¬ 
efit is outweighed by the linear increase in the amount of 
work performed per write, as the number of replicas in¬ 
creases. 

Table[4]demonstrates that two-phase commit can scale 
to a reasonable number of participants: it summarizes 
a set of experiments run across 3 zones, each with 25 
spanservers. Scaling up to 50 participants is reasonable 
in both mean and 99th-percentile, and latencies start to 
rise noticeably at 100 participants. 

5.2 Availability 

Figure [5] illustrates the availability benefits of running 
Spanner in multiple datacenters. It shows the results of 
three experiments on throughput in the presence of dat¬ 
acenter failure, all of which are overlaid onto the same 
time scale. The test universe consisted of 5 zones 
each of which had 25 spanservers. The test database was 
sharded into 1250 Paxos groups, and 100 test clients con¬ 
stantly issued non-snapshot reads at an aggregrate rate 
of 50K reads/second. All of the leaders were explic¬ 
itly placed in Z\. Five seconds into each test, all of 
the servers in one zone were killed: non-leader kills Z>\ 
leader-hard kills Z \; leader-soft kills Z \, but it gives no¬ 
tifications to all of the servers that they should handoff 
leadership first. 

Killing Zi has no effect on read throughput. Killing 
Z\ while giving the leaders time to handoff leadership to 


Figure 5: Effect of killing servers on throughput. 

a different zone has a minor effect: the throughput drop 
is not visible in the graph, but is around 3-4%. On the 
other hand, killing Z\ with no warning has a severe ef¬ 
fect: the rate of completion drops almost to 0. As leaders 
get re-elected, though, the throughput of the system rises 
to approximately 100K reads/second because of two ar¬ 
tifacts of our experiment: there is extra capacity in the 
system, and operations are queued while the leader is un¬ 
available. As a result, the throughput of the system rises 
before leveling off again at its steady-state rate. 

We can also see the effect of the fact that Paxos leader 
leases are set to 10 seconds. When we kill the zone, 
the leader-lease expiration times for the groups should 
be evenly distributed over the next 10 seconds. Soon af¬ 
ter each lease from a dead leader expires, a new leader is 
elected. Approximately 10 seconds after the kill time, all 
of the groups have leaders and throughput has recovered. 
Shorter lease times would reduce the effect of server 
deaths on availability, but would require greater amounts 
of lease-renewal network traffic. We are in the process of 
designing and implementing a mechanism that will cause 
slaves to release Paxos leader leases upon leader failure. 

5.3 TrueTime 

Two questions must be answered with respect to True- 
Time: is e truly a bound on clock uncertainty, and how 
bad does e get? For the former, the most serious prob¬ 
lem would be if a local clock’s drift were greater than 
200us/sec: that would break assumptions made by True- 
Time. Our machine statistics show that bad CPUs are 6 
times more likely than bad clocks. That is, clock issues 
are extremely infrequent, relative to much more serious 
hardware problems. As a result, we believe that True- 
Time’s implementation is as trustworthy as any other 
piece of software upon which Spanner depends. 

Figure |6]presents TrueTime data taken at several thou¬ 
sand spanserver machines across datacenters up to 2200 
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Mar 29 Mar 30 Mar 31 Apr 16AM 8AM 10AM 12PM 

Date Date (April 13) 


# fragments 

# directories 

1 

> 100M 

2-4 

341 

5-9 

5336 

10-14 

232 

15-99 

34 

100-500 

7 


Table 5: Distribution of directory-fragment counts in FI 


Figure 6: Distribution of TrueTime e values, sampled right 
after timeslave daemon polls the time masters. 90th, 99th, and 
99.9th percentiles are graphed. 


km apart. It plots the 90th, 99th, and 99.9th percentiles 
of e, sampled at timeslave daemons immediately after 
polling the time masters. This sampling elides the saw¬ 
tooth in e due to local-clock uncertainty, and therefore 
measures time-master uncertainty (which is generally 0) 
plus communication delay to the time masters. 

The data shows that these two factors in determining 
the base value of e are generally not a problem. How¬ 
ever, there can be significant tail-latency issues that cause 
higher values of e. The reduction in tail latencies begin¬ 
ning on March 30 were due to networking improvements 
that reduced transient network-link congestion. The in¬ 
crease in e on April 13, approximately one hour in dura¬ 
tion, resulted from the shutdown of 2 time masters at a 
datacenter for routine maintenance. We continue to in¬ 
vestigate and remove causes of TrueTime spikes. 


5.4 FI 


Spanner started being experimentally evaluated under 
production workloads in early 2011, as part of a rewrite 


of Google’s advertising backend called FI [35]. This 
backend was originally based on a MySQL database that 
was manually sharded many ways. The uncompressed 
dataset is tens of terabytes, which is small compared to 
many NoSQL instances, but was large enough to cause 
difficulties with sharded MySQL. The MySQL sharding 
scheme assigned each customer and all related data to a 
fixed shard. This layout enabled the use of indexes and 
complex query processing on a per-customer basis, but 
required some knowledge of the sharding in application 
business logic. Resharding this revenue-critical database 
as it grew in the number of customers and their data was 
extremely costly. The last resharding took over two years 
of intense effort, and involved coordination and testing 
across dozens of teams to minimize risk. This operation 
was too complex to do regularly: as a result, the team had 
to limit growth on the MySQL database by storing some 


data in external Bigtables, which compromised transac¬ 
tional behavior and the ability to query across all data. 

The FI team chose to use Spanner for several rea¬ 
sons. First, Spanner removes the need to manually re¬ 
shard. Second, Spanner provides synchronous replica¬ 
tion and automatic failover. With MySQL master-slave 
replication, failover was difficult, and risked data loss 
and downtime. Third, FI requires strong transactional 
semantics, which made using other NoSQL systems im¬ 
practical. Application semantics requires transactions 
across arbitrary data, and consistent reads. The FI team 
also needed secondary indexes on their data (since Span¬ 
ner does not yet provide automatic support for secondary 
indexes), and was able to implement their own consistent 
global indexes using Spanner transactions. 

All application writes are now by default sent through 
FI to Spanner, instead of the MySQL-based application 
stack. FI has 2 replicas on the west coast of the US, and 
3 on the east coast. This choice of replica sites was made 
to cope with outages due to potential major natural disas¬ 
ters, and also the choice of their frontend sites. Anecdo¬ 
tally, Spanner’s automatic failover has been nearly invisi¬ 
ble to them. Although there have been unplanned cluster 
failures in the last few months, the most that the FI team 
has had to do is update their database’s schema to tell 
Spanner where to preferentially place Paxos leaders, so 
as to keep them close to where their frontends moved. 

Spanner’s timestamp semantics made it efficient for 
FI to maintain in-memory data structures computed from 
the database state. FI maintains a logical history log of 
all changes, which is written into Spanner itself as part 
of every transaction. FI takes full snapshots of data at a 
timestamp to initialize its data structures, and then reads 
incremental changes to update them. 

Table 0 illustrates the distribution of the number of 
fragments per directory in FI. Each directory typically 
corresponds to a customer in the application stack above 
FI. The vast majority of directories (and therefore cus¬ 
tomers) consist of only 1 fragment, which means that 
reads and writes to those customers’ data are guaranteed 
to occur on only a single server. The directories with 
more than 100 fragments are all tables that contain FI 
secondary indexes: writes to more than a few fragments 
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operation 

latency (ms) 

count 

mean 

std dev 

all reads 

8.7 

376.4 

21.5B 

single-site commit 

72.3 

112.8 

31.2M 

multi-site commit 

103.0 

52.2 

32.1M 


Table 6: Fl-perceived operation latencies measured over the 
course of 24 hours. 

of such tables are extremely uncommon. The FI team 
has only seen such behavior when they do untuned bulk 
data loads as transactions. 

Table [6] presents Spanner operation latencies as mea¬ 
sured from FI servers. Replicas in the east-coast data 
centers are given higher priority in choosing Paxos lead¬ 
ers. The data in the table is measured from FI servers 
in those data centers. The large standard deviation in 
write latencies is caused by a pretty fat tail due to lock 
conflicts. The even larger standard deviation in read la¬ 
tencies is partially due to the fact that Paxos leaders are 
spread across two data centers, only one of which has 
machines with SSDs. In addition, the measurement in¬ 
cludes every read in the system from two datacenters: 
the mean and standard deviation of the bytes read were 
roughly 1.6KB and 119KB, respectively. 

6 Related Work 

Consistent replication across datacenters as a storage 
service has been provided by Megastore 0 and Dy- 
namoDB |3J. DynamoDB presents a key-value interface, 
and only replicates within a region. Spanner follows 
Megastore in providing a semi-relational data model, 
and even a similar schema language. Megastore does 
not achieve high performance. It is layered on top of 
Bigtable, which imposes high communication costs. It 
also does not support long-lived leaders: multiple repli¬ 
cas may initiate writes. All writes from different repli¬ 
cas necessarily conflict in the Paxos protocol, even if 
they do not logically conflict: throughput collapses on 
a Paxos group at several writes per second. Spanner pro¬ 
vides higher performance, general-purpose transactions, 
and external consistency. 

Pavlo et al. 0 have compared the performance of 
databases and MapReduce 0 - They point to several 
other efforts that have been made to explore database 
functionality layered on distributed key-value stores jl] 
000 as evidence that the two worlds are converging. 
We agree with the conclusion, but demonstrate that in¬ 
tegrating multiple layers has its advantages: integrating 
concurrency control with replication reduces the cost of 
commit wait in Spanner, for example. 


The notion of layering transactions on top of a repli¬ 
cated store dates at least as far back as Gifford’s disser¬ 
tation ED- Scatter 0 is a recent DHT-based key-value 
store that layers transactions on top of consistent repli¬ 
cation. Spanner focuses on providing a higher-level in¬ 
terface than Scatter does. Gray and Lamport 01 de¬ 
scribe a non-blocking commit protocol based on Paxos. 
Their protocol incurs more messaging costs than two- 
phase commit, which would aggravate the cost of com¬ 
mit over widely distributed groups. Walter (36) provides 
a variant of snapshot isolation that works within, but not 
across datacenters. In contrast, our read-only transac¬ 
tions provide a more natural semantics, because we sup¬ 
port external consistency over all operations. 

There has been a spate of recent work on reducing 
or eliminating locking overheads. Calvin [|40j elimi¬ 
nates concurrency control: it pre-assigns timestamps and 
then executes the transactions in timestamp order. H- 
Store 1391 and Granola 0 each supported their own 
classification of transaction types, some of which could 
avoid locking. None of these systems provides external 
consistency. Spanner addresses the contention issue by 
providing support for snapshot isolation. 

VoltDB (42) is a sharded in-memory database that 
supports master-slave replication over the wide area for 
disaster recovery, but not more general replication con¬ 
figurations. It is an example of what has been called 
NewSQL, which is a marketplace push to support scal¬ 
able SQL (38). A number of commercial databases im¬ 
plement reads in the past, such as MarkLogic (26) and 
Oracle’s Total Recall [30|. Lomet and Li [24| describe an 
implementation strategy for such a temporal database. 

Farsite derived bounds on clock uncertainty (much 
looser than TrueTime’s) relative to a trusted clock refer¬ 
ence 0: server leases in Farsite were maintained in the 
same way that Spanner maintains Paxos leases. Loosely 
synchronized clocks have been used for concurrency- 
control purposes in prior work @ 0 - We have shown 
that TrueTime lets one reason about global time across 
sets of Paxos state machines. 

7 Future Work 

We have spent most of the last year working with the 
FI team to transition Google’s advertising backend from 
MySQL to Spanner. We are actively improving its mon¬ 
itoring and support tools, as well as tuning its perfor¬ 
mance. In addition, we have been working on improving 
the functionality and performance of our backup/restore 
system. We are currently implementing the Spanner 
schema language, automatic maintenance of secondary 
indices, and automatic load-based resharding. Longer 
term, there are a couple of features that we plan to in- 
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vestigate. Optimistically doing reads in parallel may be 
a valuable strategy to pursue, but initial experiments have 
indicated that the right implementation is non-trivial. In 
addition, we plan to eventually support direct changes of 
Paxos configurations |22||34| . 

Given that we expect many applications to replicate 
their data across datacenters that are relatively close to 
each other, TrueTime e may noticeably affect perfor¬ 
mance. We see no insurmountable obstacle to reduc¬ 
ing e below 1ms. Time-master-query intervals can be 
reduced, and better clock crystals are relatively cheap. 
Time-master query latency could be reduced with im¬ 
proved networking technology, or possibly even avoided 
through alternate time-distribution technology. 

Finally, there are obvious areas for improvement. Al¬ 
though Spanner is scalable in the number of nodes, the 
node-local data structures have relatively poor perfor¬ 
mance on complex SQL queries, because they were de¬ 
signed for simple key-value accesses. Algorithms and 
data structures from DB literature could improve single¬ 
node performance a great deal. Second, moving data au¬ 
tomatically between datacenters in response to changes 
in client load has long been a goal of ours, but to make 
that goal effective, we would also need the ability to 
move client-application processes between datacenters in 
an automated, coordinated fashion. Moving processes 
raises the even more difficult problem of managing re¬ 
source acquisition and allocation between datacenters. 

8 Conclusions 

To summarize. Spanner combines and extends on ideas 
from two research communities: from the database com¬ 
munity, a familiar, easy-to-use, semi-relational interface, 
transactions, and an SQL-based query language; from 
the systems community, scalability, automatic sharding, 
fault tolerance, consistent replication, external consis¬ 
tency, and wide-area distribution. Since Spanner’s in¬ 
ception, we have taken more than 5 years to iterate to the 
current design and implementation. Part of this long it¬ 
eration phase was due to a slow realization that Spanner 
should do more than tackle the problem of a globally- 
replicated namespace, and should also focus on database 
features that Bigtable was missing. 

One aspect of our design stands out: the linchpin of 
Spanner’s feature set is TrueTime. We have shown that 
reifying clock uncertainty in the time API makes it possi¬ 
ble to build distributed systems with much stronger time 
semantics. In addition, as the underlying system en¬ 
forces tighter bounds on clock uncertainty, the overhead 
of the stronger semantics decreases. As a community, we 
should no longer depend on loosely synchronized clocks 
and weak time APIs in designing distributed algorithms. 
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A Paxos Leader-Lease Management 

The simplest means to ensure the disjointness of Paxos- 
leader-lease intervals would be for a leader to issue a syn¬ 
chronous Paxos write of the lease interval, whenever it 
would be extended. A subsequent leader would read the 
interval and wait until that interval has passed. 

TrueTime can be used to ensure disjointness without 
these extra log writes. The potential ?'th leader keeps a 
lower bound on the start of a lease vote from replica r as 
v[ e “ der = TT.now{).earliest, computed before ef" d (de¬ 
fined as when the lease request is sent by the leader). 
Each replica r grants a lease at lease ef™ r , which hap¬ 
pens after e r -'"' Ave (when the replica receives a lease re¬ 
quest); the lease ends at t e " d = TT.now().latest + 10, 
computed after ef c r e,ve . A replica r obeys the single¬ 
vote rule: it will not grant another lease vote until 
TT.after[t™ d ) is true. To enforce this rule across different 
incarnations of r. Spanner logs a lease vote at the grant¬ 
ing replica before granting the lease; this log write can 
be piggybacked upon existing Paxos-protocol log writes. 

When the ith leader receives a quorum of votes 
(event e j uolwn ^ it computes its lease interval as 
leasei = [TT.nowQ.latest, min r (v l i e !f Ier ) + 10]. The 
lease is deemed to have expired at the leader when 
TT.before(min r (v l f“ der ) + 10) is false. To prove disjoint¬ 
ness, we make use of the fact that the ith and (i + l)th 
leaders must have one replica in common in their quo¬ 
rums. Call that replica r0. Proof: 

leasei.end = + 10 

min r {v[ e y er ) + 10 < + 10 

<4“o r + 10 < U s «"o) + io 

tabs(ef" r d 0 ) + 10 < W(e^r) + 10 
j. / receive \ , -i c\ ^ +end 

Iabs\£i,rO J “r 1U S G,r0 

tfro < MefTiU) 

WefTto) < U(eST) 

< lease i+1 .start 


(by definition) 
(min) 
(by definition) 
(causality) 
(by definition) 
(single-vote) 
(causality) 
(by definition) 
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